Diabetes mellitus (DM) is a major public health concern. Identifying risk factors associated with DM (obesity, age, race, and gender) and targeted interventions for prevention is crucial. Logistic Regression estimates the association between risk factors and binary outcomes (presence or absence of diabetes). However, classical maximum likelihood estimation (MLE) yields unstable results in small samples with missing data, or quasi- and complete separation. Standard analytical approaches are insufficient in analyzing the complexity of healthcare data (DNA sequences, imaging, patient-reported outcomes, electronic health records (EHRs), longitudinal health measurements, diagnoses, and treatments. (Zeger et al., 2020) The Bayesian hierarchical model with Markov Chain Monte Carlo (MCMC) has been implemented on multivariate longitudinal healthcare data by integrating prior knowledge to predict patient health status (Zeger et al., 2020). Model with two levels of data structure: (1) repeated measures over time within individuals and (2) individuals nested within a population, with added exogenous covariates (e.g., age, clinical history), and endogenous covariates (e.g., current treatment), yield posterior distributions of parameters. MCMC estimation provides marginal distributions of the regression coefficients in risk prediction (pneumonia, prostate cancer, and mental disorders). The model’s limitation is its parametric nature, suggesting nonparametric or more flexible parametric models.
Application of Bayesian InferenceChatzimichail and Hatjimihail (2023), comparing parametric (with a fixed set of parameters) and non-parametric distributions (which do not make a priori assumptions) on National Health and Nutrition Examination Survey data from two separate diagnostic tests on both diseased and non-diseased populations, and provides posterior probability classifying diseases. Conventional methods based on clinical criteria and fixed numerical thresholds fail to capture the intricate relationship between diagnostic tests and the prevalence of the diseases; the dichotomous method (overlap of probability distributions between the diseased and nondiseased groups) fails to capture the complexity and heterogeneity across diverse populations. Its applicability in dealing with skewness, bimodality, or multimodality is critiqued. Integration of priors, combined with data from multiple diagnostic tests, improves diagnostic accuracy and precision. Bayesian nonparametric (vs parametric) is a flexible, adaptable, versatile, and robust approach, capturing complex data patterns, producing multimodal probability patterns vs the bimodal, double-sigmoidal curves in parametric models. Limited scholarly publications and over-dependence on priors limit model applicability. When combined with other statistical and computational techniques, it enhances diagnostic capabilities Chatzimichail and Hatjimihail (2023) in situations with systemic bias, unrepresentative, incomplete, and non-normal datasets.
Bayesian methodology explained by Schoot et al. (2021) emphasizes the importance of priors, data modeling, inferences, model checking, sampling techniques from a posterior distribution, variational inferences, and variable selection for applicability across varied research fields (social sciences, ecology, genetics, and medicine). The variable selection is emphasized because multicollinearity, insufficient sampling, and overfitting result in poor predictive performance and difficult interpretation difficult. Prior types (informative, weakly informative, and diffuse) are based on the degree of (un)certainty (hyperparameters) surrounding the population parameter; a larger variance represents greater uncertainty; a mildly informative prior analyzes small sample sizes. Using both prior elicitation methods (experts, generic experts, data-based, and sample data using maximum likelihood or sample statistics), and prior sensitivity analysis of the likelihood assesses how the priors and the likelihood align. Prior provides data-informed shrinkage, regularization, or influence algorithms, providing a high-density region, improving estimation. Specification of the priors, in small and less informative samples, strengthens the observed data to lend possible value(s) for the unknown parameter(s). With unknown parameters having varied values, observed data having fixed values, and the likelihood function generating a range of possible values, integrating the MCMC algorithm for sampled values from a given distribution through computer simulations provides empirical estimates of the posterior distribution (BRMS and Blavaan in R). The frequentist method does not consider the probability of the unknown parameters and considers them as fixed, while likelihood is based on the conditional probability distribution p(y|θ) of the data (y), given fixed parameters (θ). Spatial and temporal variability factored into Bayesian models has varied applicability (large-scale cancer genomic data, identifying novel molecular-level changes, interactions between mutated genes, capturing mutational signatures, allowing genomic-based patient stratification (clinical trials, personalized use of therapeutics), and understanding cancer evolutionary processes). The Bayesian model is reproducible, but autocorrelation in the temporal model and subjectivity issues in prior elicitation are limitations.
Prior elicitation, analytical posteriors, robustness checks in Bayesian Normal linear regression, and parametric (conjugate) model incorporating Normal–Inverse-Gamma prior have been demonstrated in metrology Klauenberg et al. (2015) to calibrate instruments. In Gaussian, errors are independent and identically distributed, the variance is unknown, the relationship between X and Y is statistical, with noise and model uncertainty, and the regression can not be treated as a measurement function. Methods like Likelihood, Bayesian, bootstrap, etc., quantify uncertainty, account for a priori information, and robustify analyses through prior elicitation and posterior calculation. Observables (data) and unobservables (parameters and auxiliary variables) are unknown and random, and the assigned probability distributions update prior knowledge about the unobservables, enhancing the elicitation and interpretation process. Normal Inverse Gamma (NIG) distribution to a posterior is from the same family of (NIG) prior distribution, and the known variance (conjugate prior) can drive vague or non-informative prior distributions (2). An alternative (hierarchical) prior provides an additional layer of distributions to uncertainty.
Bayesian Hierarchical / meta-analytic linear regression model (DeLeeuw2012?) augments data incorporating both exchangeable and unexchangeable information on regression coefficients (and standard errors) from previous studies, addressing multiple testing associated with low statistical power, issues of conducting separate significance tests for all regression coefficients in modest sample sizes across studies with different predictors, and the need for larger samples. Linear regression does not incorporate priors, produce smaller estimates that are unreliable and vulnerable to sample variations. In situations where previous studies are unavailable, priors from meta-analysis in Bayesian regression address the challenge of large sample size, supplementing and resolving the limitations of univariate analyses that overlook the relationships among multiple regression parameters within a study. With prior specifications from both prior data and current data, priors are categorized as: (1) Exchangeable when the current data and previous studies share the same set of predictors, and (2) Unexchangeable when the predictors differ across studies. (1) The probability density function for the data (using the Gibbs sampler), and (2) the likelihood function reflect prior assumptions about the model parameters before observing the data. The hierarchical unexchangeable model is applicable in studying differences in studies, enabling explicit testing of the exchangeability assumption, but the applicability is limited to the correlation issue of having identical set of predictors. (DeLeeuw, 2012).
Bayesian logistic regression (Bayesian GLM)**- A sequential clinical reasoning was applicable in screening adults (20–79 years, Taiwan), to classify incident cancers and cardiovascular disease. Three models with sequential adding of predictors: (1) demographic features (basic model), (2) six metabolic syndrome components (metabolic score model), and (3) conventional risk factors (enhanced model), and by incorporating priors, Liu (2013) demonstrated the model could address the limited availability of molecular information and is an alternative method leveraging routinely collected biological markers for classification of diseases. In contrast, the Framingham Risk Score is limited by geographic, ethnic, and social contextual heterogeneity across populations. Emulates a clinician’s evaluation process, the model assumes normally distributed regression coefficients, accounts for uncertainty in clinical weights, and averages credible intervals for predicted risk estimates. By updating prior clinical expectations with objective observed data (patient history and laboratory findings), the posterior distributions produced (Enhanced model) showed that patient background significantly contributed to baseline risk estimation by integrating individual characteristics that capture ecological heterogeneity. The model limitations are the potential interactions between predictors and external cross-validation. Bayesian multiple imputation with logistic regression, (Austin?) 2021, addresses missing data in clinical research. Analyzing reasons of missing values (i) patients refusing to answer specific questions, (ii) loss to follow-up, (iii) investigator or mechanical errors, or (iv) physicians choosing not to order certain investigations require understanding of the type of missingness: missing at random (MAR), missing not at random (MNAR), or missing completely at random (MCAR) and addressing missingness by multiple imputation (MI) ( R, SAS, or Stata) classify patients with heart failure and provide estimates of 1-year mortality. Plausible values are generated by MI, creating multiple completed datasets while simultaneously conducting identical statistical analyses across them, providing robust estimates through pooled results.
Aims
The present study focuses on the application of Bayesian logistic regression to predict diabetes status based on body mass index (BMI), age, gender, and race as predictors using a retrospective dataset (2013–2014 NHANES survey). The dataset reveals challenges such as quasi-separation, missing values, and a relatively small effective sample size, and the traditional logistic regression has limitations in dealing with these anomalies. Initial data exploration yielded 9,813 observations across five selected variables. The results from complete case analysis, listwise deletion, substantially reduced the sample size to only 14 complete cases and presented quasi-separation with implausibly large coefficients and unstable estimates. The analytic limitations of traditional logistic regression motivate us to perform Multiple Imputation by Chained Equations (MICE) in conjunction with Bayesian logistic regression. The approach could provide a flexible framework for modeling uncertainty, incorporating prior knowledge, and addressing issues related to quasi-separation and limited sample size.
Method and Data Preparation
Statistical Tool used is R, R packages, and libraries to import, manage and analyze the data. Data collected is from NHANES 2-year cross-sectional data (2013-2014 year) using 3 datasets (demographics, exam, questionnaire) Center for Health Statistics (1999). Using haven package .XPT files were imported in r-studio modified to dataframe (df).
Subsets created from the original weighted 3 datasets (demographics, exam, questionnaire) were merged into a single dataframe for analysis and exploration. The merged dataframe included selected variables of interest, was cleaned, variables categorized, and recoded and analyzed. Basic statistics, anamolies and patterns reported before running Bayesian regression. Final dataset included weighted means of all selected variables of interest. Below tabel describes the details of all variable in the data.
Response Variable (Binary, Diabetes) was defined as - “Is there one Dr you see for diabetes”
Predictor Variables (Body Mass Index, factor, 4 levels)
The original data has BMDBMIC (measured BMI) as categorical and had no missing values. It (BMI) has the following 4 levels:
o Underweight (<5th percentile)
o Normal (5th–<85th)
o Overweight (85th–<95th) o Obese (≥95th percentile)
o Missing We kept it as it is as categorization provides clinically interpretable groups
Covariates:
Gender (factor, 2 levels): Male: Female
Ethnicity (factor, 5 levels): Mexican American, Non-Hispanic, White Non-Hispanic, Black Other Hispanic, Other Race - Including Multi-Racial
Age (num, continuous)
Code
library(gt)# formation of table with variable detailsvariables <-c("ID", "BMI" , "Age", "Gender" , "Race", "PSU", "Strata", "Weight" , "Diabetes")df <-data.frame(Variable = variables, Description = descriptions <-c("Respondent sequence number", "BMI calculated as weight in kilograms divided by height in meters squared, and then rounded to one decimal place.", "Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of Age.", "Gender", "Race/ethnicity Recode of reported race and Hispanic origin information", "Sample PSU", "Sample strata", "MEC exam weight", "Diabetes status Is there one doctor or other health professional {you usually see/SP usually sees} for {your/his/her} diabetes? Do not include specialists to whom {you have/SP has} been referred such as diabetes educators, dieticians or foot and eye doctors."))df %>% gt %>%tab_header(title ="Table Variable Description" ) %>%tab_footnote(footnote ="Each variable in the dataset, accompanied by a qualitative description from the study team." )
Table Variable Description
Variable
Description
ID
Respondent sequence number
BMI
BMI calculated as weight in kilograms divided by height in meters squared, and then rounded to one decimal place.
Age
Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of Age.
Gender
Gender
Race
Race/ethnicity Recode of reported race and Hispanic origin information
PSU
Sample PSU
Strata
Sample strata
Weight
MEC exam weight
Diabetes
Diabetes status Is there one doctor or other health professional {you usually see/SP usually sees} for {your/his/her} diabetes? Do not include specialists to whom {you have/SP has} been referred such as diabetes educators, dieticians or foot and eye doctors.
Each variable in the dataset, accompanied by a qualitative description from the study team.
The data structure, a plot of the data and the breakdownof the missingness is presented
Code
# weighted means of each variable str(merged_data)
'data.frame': 9813 obs. of 9 variables:
$ ID : num 73557 73558 73559 73560 73561 ...
$ BMI : Factor w/ 4 levels "Underweight",..: NA NA NA 2 NA NA NA NA NA NA ...
$ Age : num 69 54 72 9 73 56 0 61 56 65 ...
$ Gender : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 1 2 2 1 ...
$ Race : Factor w/ 5 levels "Mexican American",..: 4 3 3 3 3 1 3 3 3 3 ...
$ PSU : num 1 1 1 2 2 1 1 1 1 2 ...
$ Strata : num 112 108 109 109 116 111 105 114 112 112 ...
$ Weight : num 13481 24472 57193 55767 65542 ...
$ Diabetes: Factor w/ 2 levels "Yes","No": 1 1 2 NA NA NA NA NA NA NA ...
mean SE
RaceMexican American 0.110450 0.0199
RaceOther Hispanic 0.060637 0.0106
RaceNon-Hispanic White 0.622315 0.0365
RaceNon-Hispanic Black 0.120895 0.0173
RaceOther Race - Including Multi-Racial 0.085704 0.0076
Using library(survey), we extracts the weighted means and sd of variables from the data having 9813 observations.
Explain your data preprocessing and cleaning steps.
Special codes in the survey are not random and cannot be dropped. Since it could introduce bias as the informative missingness if ignored (MAR or MNAR). They were transformed into NAs (based on the variable codebook). All NAs were included in the analysis, since, R automatically excludes rows with NA during during regression.
We conducted linear regression lm (), and (listwise deletion or complete case analysis) resulted in a reduced sample size (n=14).
We observed quasi-separation (warning) on our dataset.
The results all NAs removed and the breakdown of the missingness are presented below
Descriptive statistics (counts, frequencies, proportions, mean and sd ). Visualization of each variable and the proportions of variable/s are presented here.
Below are shown frequency plots for both continuous and categorical variables
A tabular view of counts, proprtions of all variables id created
Yes No <NA>
Mexican American 0.8763885 0.4178131 15.8768980
Other Hispanic 0.4891470 0.1528585 8.8352186
Non-Hispanic White 2.0992561 0.5706716 33.3842862
Non-Hispanic Black 1.4878223 0.3668603 20.5441761
Other Race - Including Multi-Racial 0.6827678 0.2140018 14.0018343
Yes No <NA>
Male 2.6903088 0.8865790 45.6537247
Female 2.9450729 0.8356262 46.9886885
Code
## Bar plot of Age, gender, race, diabetes status, BMI ## plot_bar(merged_data, title ="Figure 3(Merged dataset). Frequency plots of categorical variables.")
Method
We performed Bayesian Logistic Regression Model statistical analysis on our data based on the data characteristics as - binary outcome (Diabetes (yes/ no) -binomial based on probability Bayes rules - bayes rule linear relation between (predictor) X and (response) Y - regression discrete variable that can have two values, 0 or 1 (Bernoulli probability model) Classification tasks in regression analyze of categorical response variables -predicting or classifying the response category.
We later compare the two models
Frequentist methods Multiple Logistic regression model, Baseline Rgression model Firth (penalized regression) Model
Bayesian Logistic Regression Model
Below are the principles, concept and assumptions of the two models
Multiple logistic regression
Multiple linear regression on raw dataset, resulted in small sample size (complete case analysis and listwise deletion of NAs): (presented are first 6 deleted rows)
The data resulted in quasi-separation. Van Buuren and Van Buuren (2012).
Explored of the cause of missingness, revealed missing at random and missing not at random (MAR and MNAR) whic as reported previously are common in healthcare and public health datasets: (plot on missingness - see below)
Baseline regression model (Only BMI to predict diabetes)
We conducted baseline model regression to know whether predictors significantly improve predictive power.
Null deviance = 16.75 (baseline fit).
Residual deviance = 15.11 (with BMI).
presents that the drop is small and that BMI category adds very little predictive value over just assuming the overall diabetes prevalence.
The anomalies in data (quasi-separation was handled by performing Firth regression and the small dataset due to listwise deletion was managed by performing Multivariate Imputation by Chained Equations (MICE).
Concept and results of Firth regression and MICE presented below.
Firth (penalized) regression
Firth (penalized) regression was considered to handle quasi-separation, D’Angelo and Ran (2025). Firth regression, a frquentist approach that use Jeffreys prior for bias correction. It does not provide posterior and no sampling using MCMC (vs) bayesian logisitic regression.
Below are the summaries from the MLR (raw data), Baseline model regression, Firth regression
Data exploration of the raw dataset
Unexpected reports, patterns or anomalies in the raw data
Issue of quasi-complete separation in the data (9799 observations dropped)
Reduced sample size with reduced number of complete cases (n=14).
The model is overfitted to this subset and cannot be generalized.
Huge coefficients (e.g., 94, –50, 73) and the tiny residual deviance suggest perfect separation and sparse data in some categories with very few observations, resulted in imbalance in the outcome (very few cases of 0 or 1).
Logistic regression cannot estimate stable coefficients when predictors perfectly classify the outcome.
Firth regression dealt with quasi-separation with coefficients as finite, but the sample size was reduced (n= 14) where estimates are highly uncertain, wide confidence intervals → cannot make strong claims about predictor effects.
multivariate missingness, non-monotone (arbitrary) missingness with connected pattern, since all variables were at least observed in some cases.
in order to be able to estimate a correlation coefficient between two variables, they need to be connected, either directly by a set of cases that have scores on both, or indirectly through their relation with a third set of connected data.
Interpretation of the 14 non-missing cases - could not be trusted because of small sample size and the separation problem - Models with all predictors together and with sequential adding of predictors, all models showed unstable and extreme estimates with standard errors not meaningful. - Adding more predictors makes the deviance drop but indicated overfitting / separation, not true explanatory power. - BMI alone contributes very little - Race and gender make models appear stronger, but was based on small sample (n=14) and shows a case of complete separation, not generalizable evidence.
We decided to perform imputation, to retain full N = ~9813 to deal with small sample size and avoid quasi-separation.
Multivariate Imputation by Chained Equations (MICE)Buuren and Groothuis-Oudshoorn (2011) - Bayesian Approach
Multiple imputation (MI) is considered as an alternative approach for the given raw dataset. Flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not hamper the good performance of multiple imputation for the mean structure in samples n > 400 even for high percentages (75%) of missing data in one variable @Van Buuren and Van Buuren (2012).
Multiple Imputation (MI) is a Bayesian Approach (use popular mice package in R) and adds sampling variability to the imputations.
Iterative Imputation (MICE) imputes missing values of one variable at a time, using regression models based on the other variables in the dataset.
This is a chain process, with each imputed variable becoming a predictor for the subsequent imputation and the entire process is repeated multiple times to create several complete datasets, each reflecting different possibilities for the missing data.
Each variable is imputed using its own appropriate univariate regression model.
Code
## Multiple Imputation performed # Subset variables for imputation in analytic_data dflibrary(dplyr)library(ggplot2)library(mice)library(VIM)library(janitor)# 1. Select variables for imputationvars <-c("ID", "BMI", "Age", "Gender", "Race", "PSU", "Strata", "Weight", "Diabetes")analytic_data <- merged_data[, vars]glimpse(analytic_data)
# 2. Run mice to create 5 imputed datasetsimputed_data <-mice( analytic_data,m =5, # number of imputed datasetsmethod ='pmm', # predictive mean matchingseed =123)
ID BMI Age Gender
Min. :73557 Underweight : 218 Min. : 0.00 Male :4831
1st Qu.:76092 Normal weight:5107 1st Qu.:10.00 Female:4982
Median :78643 Overweight :1831 Median :27.00
Mean :78645 Obese :2657 Mean :31.63
3rd Qu.:81191 3rd Qu.:52.00
Max. :83731 Max. :80.00
Race PSU Strata
Mexican American :1685 Min. :1.000 Min. :104.0
Other Hispanic : 930 1st Qu.:1.000 1st Qu.:107.0
Non-Hispanic White :3538 Median :1.000 Median :111.0
Non-Hispanic Black :2198 Mean :1.483 Mean :110.9
Other Race - Including Multi-Racial:1462 3rd Qu.:2.000 3rd Qu.:115.0
Max. :2.000 Max. :118.0
Weight Diabetes
Min. : 3748 Yes:7252
1st Qu.: 13187 No :2561
Median : 20965
Mean : 31713
3rd Qu.: 37922
Max. :171395
Code
colSums(is.na(Imputed_data1))
ID BMI Age Gender Race PSU Strata Weight
0 0 0 0 0 0 0 0
Diabetes
0
Code
plot_intro(Imputed_data1, title="Figure 8. Structure of variables and missing observations.")
Code
plot_bar(Imputed_data1, title ="Figure 9. Frequency plots of categorical variables.")
Code
plot_correlation(na.omit(Imputed_data1[, c("BMI", "Diabetes")]), maxcat=5L, title ="Figure")
# Cross-tabulation# BMI vs Diabetestab_BMI <-table(Imputed_data1$BMI, Imputed_data1$Diabetes)print(tab_BMI)
Yes No
Underweight 159 59
Normal weight 3760 1347
Overweight 1342 489
Obese 1991 666
Code
prop.table(tab_BMI, 1) *100# row percentages
Yes No
Underweight 72.93578 27.06422
Normal weight 73.62444 26.37556
Overweight 73.29328 26.70672
Obese 74.93414 25.06586
Code
# Gender vs Diabetestab_gender <-table(Imputed_data1$Gender, Imputed_data1$Diabetes)prop.table(tab_gender, 1) *100
Yes No
Male 72.92486 27.07514
Female 74.84946 25.15054
Code
# Race vs Diabetestab_race <-table(Imputed_data1$Race, Imputed_data1$Diabetes)prop.table(tab_race, 1) *100
Yes No
Mexican American 71.03858 28.96142
Other Hispanic 74.40860 25.59140
Non-Hispanic White 76.17298 23.82702
Non-Hispanic Black 72.47498 27.52502
Other Race - Including Multi-Racial 73.52941 26.47059
Code
# Age vs Diabetestab_age <-table(Imputed_data1$Age, Imputed_data1$Diabetes)head (prop.table(tab_age, 1) *100)
# A tibble: 8 × 4
# Groups: BMI [4]
BMI Diabetes Count Percent
<fct> <fct> <int> <dbl>
1 Underweight Yes 159 72.9
2 Underweight No 59 27.1
3 Normal weight Yes 3760 73.6
4 Normal weight No 1347 26.4
5 Overweight Yes 1342 73.3
6 Overweight No 489 26.7
7 Obese Yes 1991 74.9
8 Obese No 666 25.1
Code
# 6. Frequency tables for categorical variablescategorical_vars <-c("BMI", "Gender", "Race", "Diabetes")for (var in categorical_vars) {cat("\nFrequency table for", var, ":\n")print(table(Imputed_data1[[var]]))print(round(prop.table(table(Imputed_data1[[var]])), 3))}
Frequency table for BMI :
Underweight Normal weight Overweight Obese
218 5107 1831 2657
Underweight Normal weight Overweight Obese
0.022 0.520 0.187 0.271
Frequency table for Gender :
Male Female
4831 4982
Male Female
0.492 0.508
Frequency table for Race :
Mexican American Other Hispanic
1685 930
Non-Hispanic White Non-Hispanic Black
3538 2198
Other Race - Including Multi-Racial
1462
Mexican American Other Hispanic
0.172 0.095
Non-Hispanic White Non-Hispanic Black
0.361 0.224
Other Race - Including Multi-Racial
0.149
Frequency table for Diabetes :
Yes No
7252 2561
Yes No
0.739 0.261
Code
# 7. Summary statistics for continuous variablescontinuous_vars <-c("Age")for (var in continuous_vars) {cat("\nSummary statistics for", var, ":\n")print(summary(Imputed_data1[[var]]))print(paste("SD:", round(sd(Imputed_data1[[var]]), 2)))}
Summary statistics for Age :
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 10.00 27.00 31.63 52.00 80.00
[1] "SD: 24.4"
Code
# 8. Bar plots for categorical variablesfor (var in categorical_vars) {ggplot(Imputed_data1, aes_string(x = var)) +geom_bar(fill ="steelblue") +labs(title =paste("Bar plot of", var), y ="Count") +theme_minimal() -> pprint(p)}
Code
# 9. Histograms for continuous variablesfor (var in continuous_vars) {ggplot(Imputed_data1, aes_string(x = var)) +geom_histogram(fill ="skyblue", color ="black", bins =30) +labs(title =paste("Histogram of", var), y ="Frequency") +theme_minimal() -> pprint(p)}
Code
# 10. Scatter plot example (BMI vs Age)ggplot(Imputed_data1, aes(x = Age, y = BMI, color = BMI)) +geom_point(alpha =0.6) +labs(title ="Scatter plot of BMI vs Age", y ="BMI", x ="Age") +theme_minimal()
Code
# 11. Relative breakdown of Diabetes by BMIggplot(breakdown_BMI, aes(x = BMI, y = Percent, fill = Diabetes)) +geom_col(position ="fill") +scale_y_continuous(labels = scales::percent) +labs(title ="Relative Breakdown of Diabetes by BMI Category",x ="BMI Category", y ="Proportion" ) +theme_minimal()
Code
# 12. Crosstab with percentagesImputed_data1 %>%tabyl(Diabetes, BMI) %>%adorn_percentages("col")
Diabetes Underweight Normal weight Overweight Obese
Yes 0.7293578 0.7362444 0.7329328 0.7493414
No 0.2706422 0.2637556 0.2670672 0.2506586
Code
# 13. Margin plot for BMI vs Diabetesmarginplot( Imputed_data1[, c("BMI", "Diabetes")],col =mdc(1:2, trans =FALSE),cex =1.2,cex.lab =1.2,cex.numbers =1.3,pch =19)
Results from MICE:
MI resulted in filling imputed values with resulting 9813 observations with no NAs. A comparative bar plot on missingness in the raw data and the imputed data is presented below.
A heatmap of the imputed dataset generated a correlation between BMI categories and Diabetes status. BMI dummy variables are strongly negatively correlated.
There was no strong linear association between BMI category and diabetes in the dataset. Chi-square calculation of categorical varaibels revealed p-value = 0.5461, which is > 0.05 with no evidence of association. Imputed data check in marginal plot- The margin plot shows that the distribution of imputed points is consistent with observed data (no strange outliers)
Comparison of multiple imputation and Bayesian data augmentation
Multiple imputation
Bayesian data augmentation
frequentist approach and requires no priors, and has moderate flexibility
performs missing data imputation and regression model fitting simultaneously
Markov Chain Monte Carlo (MCMC) draws samples from the joint posterior of regression parameters, missing values and provide complete datasets by extracting posterior means, credible intervals, and probabilities
handles missing values first by imputation, performs regression analysis, pools results
performed on the data with missingness
shrink extreme estimates back toward plausible values
propagate uncertainty added after analysis (pooling).
handles uncertainty in missing values fully propagated through the model, naturally handles small or sparse datasets and separation problems.
# Log-odds (link)Imputed_data1$logit <-predict(m_imp, type ="link") ## log (Odds) # ProbabilityImputed_data1$prob <-exp(Imputed_data1$logit) / (1+exp(Imputed_data1$logit)) # prob # Plot predicted probability vs Ageggplot(Imputed_data1, aes(x = Age, y = prob)) +geom_point(alpha =0.3) +geom_smooth(method ="loess") +labs(x ="Age", y ="Predicted Probability of Diabetes")
Findings from regression of MI data set
MLR on imputed data (Frequentist approach)
Relationship between Age and log-odds of diabetes are roughly linear but not perfectly, but are acceptable for logistic regression assumptions.
Generalized Variance Inflation Factor (vif- adjusted report there is no collinearity between predictors (GVIF between ~1.0–1.04) and we can run model without removing or dropping or combining variables.
Cooks distance and influential points, we found - Most data points are safe, not influencing the model In the data with (~9813 cases), cutoff ≈ 0.0004. A cluster at high leverage shows unusual predictor values, but not high influence. A few above Cook’s Distance cutoff: worth checking individually, but no major threat to model stability. no outliers detected (not suspected = 9813)
Results from Hosmer–Lemeshow (H–L) at alpha =0.05, with p < 0.001, we find our logistic regression model does not fit the data well.
Graph shows Residual vs fitted (imputed data model)
Results from Bayesian Data Augmentation and logistic regression
We incorporate prior knowledge that BMI increases diabetes odds by .,
We use priors for Bayesian logistic regression and compare the models with different priors in the model
Prior (intercept) - We use intercept prior from this study data
Prior (coefficients) - BMI, Age, gender
Weak prior N (0,2.5) -βBMI∼N(μ,σ2)
A common approach is to use a normal distribution, βBMI∼N(μ,σ2), for the regression coefficient.
# Residual vs fitted for imputed dataplot(m_imp$fitted.values, resid(m_imp),xlab ="Fitted values",ylab ="Residuals",main ="Residuals vs Fitted",pch =19, col ="blue")abline(h =0, col ="red", lwd =2)
Hosmer-Lemeshow test - was conducted to test for Goodness of fit of multivariate logistic regression model adjR2(m_imp) CHi-square test Visualization of the model (fitted vs residula values)
To overcome the quasi-separation issue in the data, Firth (penalized regression model) was conducted and the summary presented with only 14 complete observations.
Next, to deal with the small sample size, imputation (MICE) was conducted along with the regression predicting Diabetes ~ BMI, Age, Gender, Race.
Summar and visualization of one dataset extracted from the 5 datasets from MICE is presented here with along wiht the predicted values of the imputed model (m_imp), and plots of cross-tabulation between variables and response variable (Diabetes)
Code
# Frequentist logistic regression on raw datd - Firth logistic regression (penalized regression)library(logistf)m_firth <-logistf(Diabetes ~ BMI + Age + Gender + Race,data = merged_data)summary(m_firth)
logistf(formula = Diabetes ~ BMI + Age + Gender + Race, data = merged_data)
Model fitted by Penalized ML
Coefficients:
coef se(coef) lower 0.95
(Intercept) 2.2909182 2.0520292 -1.7059362
BMIOverweight 2.5151817 1.9838969 -1.2146946
BMIObese -0.4519637 1.8778881 -5.4843160
Age -0.2545704 0.1911568 -1.4579663
GenderFemale -2.1675543 2.0190594 -17.0316807
RaceOther Hispanic 0.9518795 1.9225099 -12.0194116
RaceNon-Hispanic White -2.5671499 2.2189914 -21.6563170
RaceNon-Hispanic Black 3.3280688 2.2005227 -0.4678607
RaceOther Race - Including Multi-Racial 1.3072954 2.0609641 -2.6483209
upper 0.95 Chisq p method
(Intercept) 16.1697706 1.25690744 0.26223728 2
BMIOverweight 21.9385716 1.68994830 0.19360778 2
BMIObese 3.4687761 0.05538832 0.81393924 2
Age 0.1148471 1.81192970 0.17827692 2
GenderFemale 5.9967059 0.69556457 0.40427809 2
RaceOther Hispanic 10.8617660 0.20246895 0.65273532 2
RaceNon-Hispanic White 1.6700624 1.36287409 0.24304000 2
RaceNon-Hispanic Black 25.3575974 2.86971841 0.09026066 2
RaceOther Race - Including Multi-Racial 21.7820826 0.39215076 0.53117103 2
Method: 1-Wald, 2-Profile penalized log-likelihood, 3-None
Likelihood ratio test=6.066658 on 8 df, p=0.6397653, n=14
Wald test = 5.257182 on 8 df, p = 0.7297677
Code
# Imputed data plots ## pred_prob_imputed #Imputed_data1 <- Imputed_data1 %>%mutate(pred_prob_imputed =predict(m_imp, type ="response")) # predicted probabilitiesImputed_plot <- Imputed_data1 %>%select(BMI, pred_prob_imputed) %>%mutate(Source ="Imputed")# Rename probability column to common nameImputed_plot <- Imputed_plot %>%rename(Pred_Prob = pred_prob_imputed)ggplot(Imputed_data1, aes(x = BMI, y = pred_prob_imputed, fill = BMI)) +geom_boxplot(alpha =0.6) +geom_jitter(width =0.2, alpha =0.3, size =1) +labs(x ="BMI Category", y ="Predicted Probability of Diabetes",title ="Predicted Diabetes Probability by BMI Category") +theme_minimal()
Code
merged_data_clean <- merged_data %>%filter(!is.na(BMI), !is.na(Diabetes))ggplot(merged_data_clean, aes(x = BMI, fill =factor(Diabetes))) +geom_bar(position ="fill") +scale_y_continuous(labels = scales::percent) +labs(x ="BMI Category", y ="Proportion with Diabetes",fill ="Diabetes",title ="Observed Diabetes Proportions by BMI Category") +theme_minimal()
Code
ggplot(Imputed_data1, aes(x = Race, y = pred_prob_imputed, fill = Race)) +geom_boxplot(alpha =0.6) +geom_jitter(width =0.2, alpha =0.3, size =1) +labs(x ="Race ", y ="Predicted Probability of Diabetes",title ="Predicted Diabetes Probability by Race ") +theme_minimal()
Code
merged_data_clean_Race <- merged_data %>%filter(!is.na(Race), !is.na(Diabetes))ggplot(merged_data_clean, aes(x = Race, fill =factor(Diabetes))) +geom_bar(position ="fill") +scale_y_continuous(labels = scales::percent) +labs(x ="Race ", y ="Proportion with Diabetes",fill ="Diabetes",title ="Observed Diabetes Proportions by Race ") +theme_minimal()
Proportion of Diabetes status and the group category (age <40 and >40) is tabulated below
Code
ggplot(merged_data, aes(x = Age, y = Diabetes)) +geom_jitter(size =0.2)
Code
ggplot(Imputed_data1, aes(x = Age, y = Diabetes)) +geom_jitter(size =0.2)
Code
# Create age groups# Create contingency table with DiabetesImputed_data1$Age_group <-ifelse(Imputed_data1$Age <40, "<40", ">=40")tab_age <-table(Imputed_data1$Age_group, Imputed_data1$Diabetes)prop_age <-prop.table(tab_age, 1) *100tab_age
Yes No
<40 4414 1691
>=40 2838 870
Code
prop_age
Yes No
<40 72.30139 27.69861
>=40 76.53722 23.46278
Code
# Convert table to data framedf_age <-as.data.frame(tab_age)names(df_age) <-c("Age_group", "Diabetes", "Count") # rename columns
Code
## Reference: Gelman et al., 2008, “Weakly informative priors: Normal(0, 2.5) for coefficients (b) and Normal(0, 5) for the intercept as default weakly informative priors for logistic regression ### bayesian logitic regression ## library(brms)priors <-c(set_prior("normal(0, 2.5)", class ="b"), # for coefficientsset_prior("normal(0, 5)", class ="Intercept") # for intercept)formula_bayes <-bf(Diabetes ~ Age + BMI + Gender + Race)Diabetes_prior <-brm(formula = formula_bayes,data = Imputed_data1,family =bernoulli(link ="logit"), # logistic regressionprior = priors,chains =4,iter =2000,seed =123,control =list(adapt_delta =0.95))
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported" -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/" -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/" -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
679 | #include <cmath>
| ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1
Family: bernoulli
Links: mu = logit
Formula: Diabetes ~ Age + BMI + Gender + Race
Data: Imputed_data1 (Number of observations: 9813)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept -0.77 0.16 -1.09 -0.44 1.00
Age -0.00 0.00 -0.01 -0.00 1.00
BMINormalweight 0.03 0.16 -0.27 0.34 1.00
BMIOverweight 0.08 0.16 -0.24 0.40 1.00
BMIObese 0.03 0.16 -0.29 0.35 1.00
GenderFemale -0.09 0.05 -0.18 -0.00 1.00
RaceOtherHispanic -0.15 0.09 -0.34 0.03 1.00
RaceNonMHispanicWhite -0.21 0.07 -0.34 -0.08 1.00
RaceNonMHispanicBlack -0.06 0.07 -0.20 0.09 1.00
RaceOtherRaceMIncludingMultiMRacial -0.11 0.08 -0.26 0.05 1.00
Bulk_ESS Tail_ESS
Intercept 1834 1902
Age 4495 2737
BMINormalweight 1820 1960
BMIOverweight 1899 2263
BMIObese 1884 1952
GenderFemale 3564 2756
RaceOtherHispanic 2685 2528
RaceNonMHispanicWhite 2156 2917
RaceNonMHispanicBlack 2444 2590
RaceOtherRaceMIncludingMultiMRacial 2311 2896
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
plot(Diabetes_prior)
Code
## Draws # Generate fitted draws directly with brmsfitted_draws <-fitted( Diabetes_prior,newdata = Imputed_data1,summary =FALSE, # gives all posterior draws instead of summarynsamples =100# limit to 100 draws)# Convert to long format manuallyfitted_long <-data.frame(BMI =rep(Imputed_data1$BMI, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))### BMI Plot the fitted linesggplot(fitted_long, aes(x = BMI, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes")
Code
### Agefitted_long <-data.frame(Age =rep(Imputed_data1$Age, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))ggplot(fitted_long, aes(x = Age, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes")
Code
### Racefitted_long <-data.frame(Race =rep(Imputed_data1$Race, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))ggplot(fitted_long, aes(x = Race, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes")
Code
### Genderfitted_long <-data.frame(Gender =rep(Imputed_data1$Gender, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))ggplot(fitted_long, aes(x = Gender, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes")
Code
### Age cut by 10 years and Diabetes plot and histogram (imputed data) Imputed_data1 %>%mutate(age_bracket =cut(Age, breaks =seq(10, 100, by =10))) %>%group_by(age_bracket) %>%summarise(Diabetes =mean(Diabetes =="Yes")) %>%ggplot(aes(x = age_bracket, y = Diabetes)) +geom_point() +theme(axis.text.x =element_text(angle =45, vjust =0.5))
Code
### predict values of 100 draws, Simulated predictions (binary diabetes outcome)pred <-posterior_predict(Diabetes_prior, newdata = Imputed_data1, draws =100)# data frame for summarizingpred_df <-as.data.frame(t(pred)) # proportion of diabetes = 1 per drawprop_diabetes <-colMeans(pred_df ==1)prop_df <-tibble(draw =1:length(prop_diabetes),proportion_Diabetes = prop_diabetes ## proportion of Diabetes with age cut category)library(ggplot2)ggplot(prop_df, aes(x = proportion_Diabetes)) +geom_histogram(color ="white")
Bayesian Logistic Regression Model - prior (weakly informative prior used) - complilation, iterations, and posterior draws using NUTS sampling - Fitted draws from the model posterior (n=100) were analyzed - estimates, Rhat were analyzed for convergence. - plots are presented below - Histogram of predicted values (n=100 draws), shows observed proportion of Diabetes ostatus. - - Scatterplot of the proportion of Diabetes grouped by age.
We created posterior model from the posterior draws (100), and analysed our simulated prior model. - plotted posterior predicted values of Diabetes against Age, Race and BMI
Code
library(brms)library(GGally)# Simulate the modelDiabetes_model_1 <-update(Diabetes_prior,sample_prior ="yes"# includes priors + data likelihood)
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported" -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/" -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/" -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
679 | #include <cmath>
| ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1
# BMI# Posterior fitted values (probabilities of Diabetes)fitted_draws <-fitted( Diabetes_model_1, # <-- use posterior model herenewdata = Imputed_data1,summary =FALSE, # full posterior drawsnsamples =100# sample 100 posterior draws)# Reshape into long formatfitted_long <-data.frame(BMI =rep(Imputed_data1$BMI, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))# Plot posterior fitted lines (BMI)library(ggplot2)ggplot(fitted_long, aes(x = BMI, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes",title ="Posterior fitted draws by BMI" )
Code
# Age# Reshape into long formatfitted_long <-data.frame(Age =rep(Imputed_data1$Age, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))# Plot posterior fitted lineslibrary(ggplot2)ggplot(fitted_long, aes(x = Age, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes",title ="Posterior fitted draws by Age" )
Code
# Race# Reshape into long formatfitted_long <-data.frame(Race =rep(Imputed_data1$Race, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))# Plot posterior fitted lineslibrary(ggplot2)ggplot(fitted_long, aes(x = Race, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes",title ="Posterior fitted draws by Race" )
Code
fitted_long <-data.frame(Gender =rep(Imputed_data1$Gender, times =100),draw =rep(1:100, each =nrow(Imputed_data1)),.value =as.vector(t(fitted_draws)))# Plot posterior fitted lineslibrary(ggplot2)ggplot(fitted_long, aes(x = Gender, y = .value, group = draw)) +geom_line(size =0.1, alpha =0.4) +labs(y ="Predicted probability of Diabetes",title ="Posterior fitted draws by Gender" )
We performed Bayesian logistic regression to model the probability of diabetes as a function of age, BMI, gender, and race. -
We used Weakly informative normal priors Gosho et al. (2025) on all regression coefficients. Models were fit using four MCMC chains with 2000 iterations each, including 1000 warm-up iterations. Convergence was assessed using Rhat values, effective sample size, and trace plots. Posterior predictive checks were used to evaluate model fit, and coefficients were exponentiated to report odds ratios with 95% credible intervals
The cluster of points representing the higher probability of diabetes appears to be denser among individuals in the middle to older Age ranges (e.g., roughly from 40 to 80 years old), compared to the younger Age ranges, although diabetes is present even at younger Ages.
Comparing Models
Linear regression model on raw data
Multivariate logistic regression on imputed dataset (MI + MLR)
Bayesian Logistic Regression on imputed data The spread of these lines provides an indication of the variability or uncertainty in the predicted probabilities within each BMI group (posterior model). the plots visually demonstrate the well-established relationship between BMI and the predicted probability of diabetes, with the risk significantly increasing as BMI moves from normal weight to overweight and obese categories
Conclusion
Summarize your key findings.
Discuss the implications of your results.
References
Ali, Sakhawat, Rizwana Hussain, Rohaib A Malik, Raheema Amin, and Muhammad N Tariq. 2024. “Association of Obesity With Type 2 Diabetes Mellitus: A Hospital-Based Unmatched Case-Control Study.”Cureus 16 (1): 1–7. https://doi.org/10.7759/cureus.52728.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.”Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Center for Health Statistics, National. 1999. “Vital and Health Statistics Reports Series 2, Number 161, September 2013.”National Health and Nutrition Examination Survey: Analytic Guidelines, 1999–2010. https://wwwn.cdc.gov/nchs/data/series/sr02_161.pdf.
Chatzimichail, Theodora, and Aristides T. Hatjimihail. 2023. “A Bayesian Inference Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.”Diagnostics 13 (19). https://doi.org/10.3390/DIAGNOSTICS13193135,.
D’Angelo, Gina, and Di Ran. 2025. “Tutorial on Firth’s Logistic Regression Models for Biomarkers in Preclinical Space.”Pharmaceutical Statistics 24 (1): 1–8. https://doi.org/10.1002/pst.2422.
Gosho, Masahiko, Ryota Ishii, Kengo Nagashima, Hisashi Noma, and Kazushi Maruo. 2025. “Determining the prior mean in Bayesian logistic regression with sparse data: a nonarbitrary approach.”Journal of the Royal Statistical Society. Series C: Applied Statistics 74 (1): 126–41. https://doi.org/10.1093/jrsssc/qlae048.
Klauenberg, Katy, Gerd Wübbeler, Bodo Mickan, Peter Harris, and Clemens Elster. 2015. “A tutorial on Bayesian Normal linear regression.”Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, et al. 2021. “Bayesian statistics and modelling.”Nature Reviews Methods Primers 1 (1): 1. https://doi.org/10.1038/s43586-020-00001-2.
Van Buuren, Stef, and Stef Van Buuren. 2012. Flexible Imputation of Missing Data. Vol. 10. CRC press Boca Raton, FL.